Keyword [MANN (Memory-Augmented Neural Network)] [Memory] [NTM (Neural Turing Machines)]

Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Meta-learning with memory-augmented neural networks. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1842–1850, 2016.

1. Overview

1.1. Motivation

When new data is encountered, the models must inefficiently relearn

In this paper, it proposes Memory-Augmented Neural Network (MANN), such as NTM

quickly encode and retrieve new information
rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples without re-training (one-shot learning)
introduce a new method (Least Recently Used Access) for accessing an external memory

slowly learn an abstract method for obtaining useful representations of raw data, via gradient descent
rapidly bind never-before-seen information after a single presentation, via an external memory module

1.2. Meta-Learning Task Methodology

offset input. label are shuffled from dataset-to-dataset
must learn to hold data samples in memory until the appropriate labels are presented at the next time

1.2.1. Input

[batch size, length of an episode, hwc+class_nb]

for exmaple, [16, 50, 20201+5] for Omniglot dataset in the paper
class_nb at time t is the label of input image at time t-1

1.2.2. Output

[batch size, length of an episode, class_nb of an episode]

for exmaple, [16, 50, 5]. Only 5 class in an episode whose length is 50 (input images)

For a given episode, ideal performance involves a random guess for the first presentation of a class, and use of memory to achieve perfect accuracy thereafter.

1.3. Neural Turing Machines

Memory encoding and retrieval in a NTM external memory module is rapid.

1.3.1. Memory Read

r. read memory

1.4. Least Recently Used Access

content-based memory writer

1.4.1. Usage Weights

wu. usage weights
wr. read weights
ww. write weights
γ. decay parameter, 0.95 in this paper

1.4.2. Least-Used Weights

wlu. least-used weights
m(v, n). nth smallest element of the vector v

1.4.3. Write Weights

sigma. sigmoid function
α. learnable gate parameter

1.4.4. Memory Write

prior to writing to memory, the least used memory location is set to zero

1.5. Access Module

Code

Input. (16, 50, 2020)
(a) reshape to (50, 16, 2020)
(b) each time (16, 20*20) come into Access Module and output

M_t (16, 128, 40)
c_t (16, 200)
h_t (16, 200)
r_t (16, 4*40)
wr_t (16, 4, 128)
wu_t (16, 128)

M (50, 16, 128, 40)
c (50, 16, 200)
h (50, 16, 200)
r (50, 16, 4*40)
wr (50, 16, 4, 128)
wu (50, 16, 128)

(d) cat r and h to get (50, 16, 200+160)
(e) (50, 16, 360) · (360, 5) → (50, 16, 5) → (16, 50, 5)
(f) calculate loss, BP and update

2. Experiments

2.1. DataSet

2.1.1. Omniglot

contains over 1600 separate classes wuth only a few examples per class, aptly leading to it being called the transpose of MNIST
apply data augmentation. translate and rotate
create new class through 90°, 180° and 270° rotations
1200 classes for training, 423 classes for testing
image downscale to 20x20
the classes in testing set are different from classes in training set
In testing set, each episode contains unique classes.

2.2. Performance

x-axis. training episode. when entering a new episode, the memory will be set to 0
y-axis. testing performance
nst Instance. during a certain episode, it is the nst time to see the sample of one class.
five-character string label. to reduce the length of the one-hot label, it uses 5-length string to represent labels. 5 position for each contains 5 possible character→ total 5^5 class can be represent

MANN with standard NTM access module is worse than with LRU Access

2.3. Persistent Memory Inference

As each episode contains unique classes, wipe memory from episode to episode is important
It becomes worse without wiping memory

2.4. Curriculum Training

first tasked to classify fiftenn classes per episode
every 10,000 episodes of training thereafter, the maximum number of classes presented per episode incremented by one